Goto

Collaborating Authors

 execution cost


AReaL-Hex: Accommodating Asynchronous RL Training over Heterogeneous GPUs

Yan, Ran, Jiang, Youhe, Wu, Tianyuan, Gao, Jiaxuan, Mei, Zhiyu, Fu, Wei, Mai, Haohui, Wang, Wei, Wu, Yi, Yuan, Binhang

arXiv.org Artificial Intelligence

Maximizing training throughput and cost-efficiency of RL for LLMs is essential to democratize this advanced technique. One promising but challenging approach is to deploy such a computational workflow over heterogeneous GPUs. Unlike conventional large-scale LLM pretraining, RL training generally decomposes into three coupled stages, i.e., rollout generation, reward computation, and policy/value updates, which exhibit markedly different compute intensities, memory footprints, and communication patterns. Recent research shows that fully asynchronous RL training can disaggregate these stages across disjoint hardware pools without sacrificing training stability, creating a great opportunity for real-world heterogeneous deployment. To this end, we present AReaL-Hex, a heterogeneity-aware asynchronous RL training system that effectively schedules how to execute rollout generation and policy model training over heterogeneous GPUs while enforcing data staleness bounds. Concretely, we use a two-phase scheduler: (i) a constrained search with MILP to select per-stage parallelization strategies and workload assignments given a resource budget, and (ii) a graph-partitioning step that allocates heterogeneous GPUs and interconnects to maximize end-to-end throughput. Built atop a fully asynchronous RL architecture, AReaL-Hex maps HBM-I/O-bound generation and compute-bound optimization to more cost-efficient resources and balances their producer-consumer interactions to avoid both idleness and stale rollout trajectories. On the mathematical reasoning task with various model scales (1.5B, 7B, and 14B), compared to homogeneous deployments of state-of-the-art asynchronous RL systems: (i) When maintaining the same total budgets, AReaL-Hex delivers up to 1.50x higher training throughput; (ii) When achieving the same training throughput, AReaL-Hex results in up to 1.46x reduction in training cost.


A Holistic Architecture for Monitoring and Optimization of Robust Multi-Agent Path Finding Plan Execution

Zahrádka, David, Mužíková, Denisa, Woller, David, Kulich, Miroslav, Švancara, Jiří, Barták, Roman

arXiv.org Artificial Intelligence

The goal of Multi-Agent Path Finding (MAPF) is to find a set of paths for a fleet of agents moving in a shared environment such that the agents reach their goals without colliding with each other. In practice, some of the robots executing the plan may get delayed, which can introduce collision risk. Although robust execution methods are used to ensure safety even in the presence of delays, the delays may still have a significant impact on the duration of the execution. At some point, the accumulated delays may become significant enough that instead of continuing with the execution of the original plan, even if it was optimal, there may now exist an alternate plan which will lead to a shorter execution. However, the problem is how to decide when to search for the alternate plan, since it is a costly procedure. In this paper, we propose a holistic architecture for robust execution of MAPF plans, its monitoring and optimization. We exploit a robust execution method called Action Dependency Graph to maintain an estimate of the expected execution duration during the plan's execution. This estimate is used to predict the potential that finding an alternate plan would lead to shorter execution. We empirically evaluate the architecture in experiments in a real-time simulator which we designed to mimic our real-life demonstrator of an autonomous warehouse robotic fleet.


Speedup Techniques for Switchable Temporal Plan Graph Optimization

Jiang, He, Lin, Muhan, Li, Jiaoyang

arXiv.org Artificial Intelligence

Multi-Agent Path Finding (MAPF) focuses on planning collision-free paths for multiple agents. However, during the execution of a MAPF plan, agents may encounter unexpected delays, which can lead to inefficiencies, deadlocks, or even collisions. To address these issues, the Switchable Temporal Plan Graph provides a framework for finding an acyclic Temporal Plan Graph with the minimum execution cost under delays, ensuring deadlock- and collision-free execution. Unfortunately, existing optimal algorithms, such as Mixed Integer Linear Programming and Graph-Based Switchable Edge Search (GSES), are often too slow for practical use. This paper introduces Improved GSES, which significantly accelerates GSES through four speedup techniques: stronger admissible heuristics, edge grouping, prioritized branching, and incremental implementation. Experiments conducted on four different map types with varying numbers of agents demonstrate that Improved GSES consistently achieves over twice the success rate of GSES and delivers up to a 30-fold speedup on instances where both methods successfully find solutions.


CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning

Wu, Duo, Wang, Jinghe, Meng, Yuan, Zhang, Yanning, Sun, Le, Wang, Zhi

arXiv.org Artificial Intelligence

Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g. vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g. execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans of which the costs outweigh task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, CATP-LLM incorporates a tool planning language to enhance the LLM to generate non-sequential plans of multiple branches for efficient concurrent tool execution and cost reduction. Moreover, it further designs a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In lack of public cost-related datasets, we further present OpenCATP, the first platform for cost-aware planning evaluation. Experiments on OpenCATP show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the average improvement of 28.2%-30.2% higher plan performance and 24.7%-45.8% lower costs even on the challenging planning tasks. The codes of CATP-LLM and OpenCATP will be publicly available.


Online Multi-Agent Pickup and Delivery with Task Deadlines

Makino, Hiroya, Ito, Seigo

arXiv.org Artificial Intelligence

Managing delivery deadlines in automated warehouses and factories is crucial for maintaining customer satisfaction and ensuring seamless production. This study introduces the problem of online multi-agent pickup and delivery with task deadlines (MAPD-D), which is an advanced variant of the online MAPD problem incorporating delivery deadlines. MAPD-D presents a dynamic deadline-driven approach that includes task deadlines, with tasks being added at any time (online), thus challenging conventional MAPD frameworks. To tackle MAPD-D, we propose a novel algorithm named deadline-aware token passing (D-TP). The D-TP algorithm is designed to calculate pickup deadlines and assign tasks while balancing execution cost and deadline proximity. Additionally, we introduce the D-TP with task swaps (D-TPTS) method to further reduce task tardiness, enhancing flexibility and efficiency via task-swapping strategies. Numerical experiments were conducted in simulated warehouse environments to showcase the effectiveness of the proposed methods. Both D-TP and D-TPTS demonstrate significant reductions in task tardiness compared to existing methods, thereby contributing to efficient operations in automated warehouses and factories with delivery deadlines.


Optimizing pre-scheduled, intermittently-observed MDPs

Zhong, Patrick, Rossi, Federico, Shell, Dylan A.

arXiv.org Artificial Intelligence

A challenging category of robotics problems arises when sensing incurs substantial costs. This paper examines settings in which a robot wishes to limit its observations of state, for instance, motivated by specific considerations of energy management, stealth, or implicit coordination. We formulate the problem of planning under uncertainty when the robot's observations are intermittent but their timing is known via a pre-declared schedule. After having established the appropriate notion of an optimal policy for such settings, we tackle the problem of joint optimization of the cumulative execution cost and the number of state observations, both in expectation under discounts. To approach this multi-objective optimization problem, we introduce an algorithm that can identify the Pareto front for a class of schedules that are advantageous in the discounted setting. The algorithm proceeds in an accumulative fashion, prepending additions to a working set of schedules and then computing incremental changes to the value functions. Because full exhaustive construction becomes computationally prohibitive for moderate-sized problems, we propose a filtering approach to prune the working set. Empirical results demonstrate that this filtering is effective at reducing computation while incurring only negligible reduction in quality. In summarizing our findings, we provide a characterization of the run-time vs quality trade-off involved.


Hubbard-Stratonovich Detector for Simple Trainable MIMO Signal Detection

Takabe, Satoshi, Abe, Takashi

arXiv.org Artificial Intelligence

Massive multiple-input multiple-output (MIMO) is a key technology used in fifth-generation wireless communication networks and beyond. Recently, various MIMO signal detectors based on deep learning have been proposed. Especially, deep unfolding (DU), which involves unrolling of an existing iterative algorithm and embedding of trainable parameters, has been applied with remarkable detection performance. Although DU has a lesser number of trainable parameters than conventional deep neural networks, the computational complexities related to training and execution have been problematic because DU-based MIMO detectors usually utilize matrix inversion to improve their detection performance. In this study, we attempted to construct a DU-based trainable MIMO detector with the simplest structure. The proposed detector based on the Hubbard--Stratonovich (HS) transformation and DU is called the trainable HS (THS) detector. It requires only $O(1)$ trainable parameters and its training and execution cost is $O(n^2)$ per iteration, where $n$ is the number of transmitting antennas. Numerical results show that the detection performance of the THS detector is better than that of existing algorithms of the same complexity and close to that of a DU-based detector, which has higher training and execution costs than the THS detector.


Prediction of Horizontal Data Partitioning Through Query Execution Cost Estimation

Arsov, Nino, Velinov, Goran, Dimovski, Aleksandar S., Koteska, Bojana, Sahpaski, Dragan, Kon-Popovska, Margina

arXiv.org Machine Learning

The excessively increased volume of data in modern data management systems demands an improved system performance, frequently provided by data distribution, system scalability and performance optimization techniques. Optimized horizontal data partitioning has a significant influence of distributed data management systems. An optimally partitioned schema found in the early phase of logical database design without loading of real data in the system and its adaptation to changes of business environment are very important for a successful implementation, system scalability and performance improvement. In this paper we present a novel approach for finding an optimal horizontally partitioned schema that manifests a minimal total execution cost of a given database workload. Our approach is based on a formal model that enables abstraction of the predicates in the workload queries, and are subsequently used to define all relational fragments. This approach has predictive features acquired by simulation of horizontal partitioning, without loading any data into the partitions, but instead, altering the statistics in the database catalogs. We define an optimization problem and employ a genetic algorithm (GA) to find an approximately optimal horizontally partitioned schema. The solutions to the optimization problem are evaluated using PostgreSQL's query optimizer. The initial experimental evaluation of our approach confirms its efficiency and correctness, and the numbers imply that the approach is effective in reducing the workload execution cost.


Dynamic Sampling Of Video To Imagery For Deep Learning

#artificialintelligence

While today's deep learning systems are able to natively analyze video, the large file sizes of high resolution movies present unique challenges in terms of storage space and computational requirements. Sampling them into sequences of still images not only allows for real-time processing of unlimited-length videos but opens the door for creative new applications like "video ngrams." The most straightforward way to sample a video into a sequence of still images is to use a fixed-rate time-based mechanism such as one frame per second. This kind of sampling is supported natively by most tools like ffmpeg and provides a simplistic and robust workflow. At the same time, it is highly inefficient, especially for videos where there is a lot of repetition. In the case of television news, a considerable portion of the airtime is devoted to motionless anchors sitting in an unchanging studio, meaning there can be quite literally thousands of nearly identical frames in a single broadcast.


Implicitly Coordinated Multi-Agent Path Finding under Destination Uncertainty: Success Guarantees and Computational Complexity

Nebel, Bernhard, Bolander, Thomas, Engesser, Thorsten, Mattmüller, Robert

Journal of Artificial Intelligence Research

In multi-agent path finding (MAPF), it is usually assumed that planning is performed centrally and that the destinations of the agents are common knowledge. We will drop both assumptions and analyze under which conditions it can be guaranteed that the agents reach their respective destinations using implicitly coordinated plans without communication. Furthermore, we will analyze what the computational costs associated with such a coordination regime are. As it turns out, guarantees can be given assuming that the agents are of a certain type. However, the implied computational costs are quite severe. In the distributed setting, we either have to solve a sequence of NP-complete problems or have to tolerate exponentially longer executions. In the setting with destination uncertainty, bounded plan existence becomes PSPACE-complete. This clearly demonstrates the value of communicating about plans before execution starts.